Introduction

Objective

About The Dataset

Main Instructions

Load and Describe Data

In this project, we create our dataset into 3 different datasets:

  1. data as the original data, used for exploratory data analysis
  2. data_clean as the cleaned data form missing value and duplicated, used for data cleaning until feature engineering & encoding
  3. df_pre as the final dataset used for modeling. This dataset already processed with log/norm

Load Data

Describe Data

Conclusion</br>

Data Understanding: Exploratory Data Analysis (EDA)

Statistical Summary

Separation of Categorical & Numerical Data

Statistical Numerical Data

Conclusion from Statistical Numerical Data

Statistical Categorical Data

Categorical Data Conclusion</br>

Check Missing Value

education previous_year_rating employee_id

There are 2 features with missing value, 1 numeric & 1 non-numeric. the education feature with non-numeric data types has 4.40% missing value and the previous_year_rating feature with numeric data types has 7.52% missing value. When compared with the total row data, the number of missing values is relatively small.

Check Cardinality or Unique Value from Categocial Data

Check Dulicated Data

In the raw data, there are no duplicate values

Early Brief

Graphical Approach

Univariate Analysis

Do data analysis for each feature separately, look the distrbution of data in details

Numerical Data

Distribution of Promoted Employees based on age

Probability to Get Promotion based on KPIs_met >80%

Probability to Get Promotion based on previous_year_rating

Probability to Get Promotion based on awards_won?

Distribution & Probability of Promoted Employees based on length_of_service

Probability to Get Promotion based on no_of_trainings

Distribution & Probability of Promoted Employees based on avg_training_score

Distribution

Non - Numerical (Categorical)

Multivariate Analysis

Numerical Data

Non Numerical Data (Categorical)

Probability to Get Promotion based on region

Probability to Get Promotion based on department

Probability to Get Promotion based on education

Probability to Get Promotion based on gender

Probability to Get Promotion based on recruitment_channel

Probability to Get Promotion based on Gender and KPI
Probability to Get Promotion based on Average_Training_Score and KPI
Probability to Get Promotion based on Department and KPI
Probability to Get Promotion based on Department and Gender
Probability to Get Promotion based on Award and KPI

Insight from Original Data

This early insights are obtained from original data (no action on pre-processing yet). These insights will be used as the ideas for data pre-processing

  1. There are 2 features with missing value, 1 for numerical & 1 for non-numerical (categorical data). previous_year_rating as numerical data has 7.52% (4124 rows) missing value. education as non-numerical data has 4.40% (2409 rows) missing value. Compared to the dataset, this missing value is not too big
    Klick distribution_of_missing_value for more details
  1. Karyawan yang mendapat promosi paling banyak berasal dari rentang usia 25-30 (is_promoted >= 100).
    Klik promoted_by_age untuk melihat visualisasinya
  1. Karyawan yang KPI nya terpenuhi lebih banyak mendapat promosi. Meski demikian, ternyata ada karyawan yang KPI nya tidak terpenuhi, tetapi mendapatkan promosi.
    Klik promoted_by_kpi untuk melihat visualisasinya
  1. Karyawan yang memiliki rating >= 3 lebih banyak mendapat promosi. Namun demikian, ada juga karyawan yang rating < 3 tetap medapat promosi.
    Klik previous_year_rating untuk melihat visualisasinya
  1. Awards tidak berpengaruh kuat terhadap promosi karyawan. Bahkan karyawan yang mendapat promosi lebih banyak dari mereka yang tidak memiliki award.
    Klik awards_won? untuk melihat visualisasinya
  1. Karyawan yang mendapat promosi kebanyakan berasal dari rentang durasi bekerja antara 1-10 tahun (is_promoted > = 100).
    Klik length_of_service untuk melihat visualisasinya
  1. Jumlah training yang banyak tidak otomatis karyawan tersebut mendapat promosi. Bahkan jumlah training >= 5, hampir tidak ada yang mendapat promosi.
    Klik no_of_trainings untuk melihat visualisasinya
  1. Dengan jumlah training masing2 karyawan, rata2 skor yang diperoleh pun tidak begitu spesifik terhadap tingkat promosinya. Persebaran skor rata2 nya tidak menunjukkan perbedaan signifikan dari segi jumlah yang mendapat promosi.
    Klik avg_training_score untuk melihat visualisasinya
  1. Dari 5 fitur kategorikal, region merupakan fitur yang memiliki nilai unik paling banyak, yaitu 34 nilai unik.
    Klik unique_value untuk melihat visualisasinya
  1. Pesebaran karyawan yang dapat promosi tidak merata untuk setiap departemen. Karyawan yang paling banyak mendapat promosi berasal dari departemen Sales & Marketing.
    Klik promoted_by_department untuk melihat visualisasinya
  1. Karyawan yang mendapat promosi paling banyak memiliki background pendidikan sarjana.
    Klik promoted_by_education untuk melihat visualisasinya

  2. Karyawan yang mendapat promosi paling banyak ber-gender male.
    Klik promoted_by_gender untuk melihat visualisasinya

Data Preparation

Fix The Duplicate Values

there are no duplicate values so we decided to remove the duplicates

Fix The Missing Value

Remove employee_id since every rows have each unique values*</br>

From this step, the dataset is changed into new object called data_clean

Fix the missing value in education feature with mode() function to fill them with the most frequently occuring values

Fix the missing value in previous_year_rating feature with median() function to fill them with the median values. The reason is the distribution of this feature looks normal (the mean value and median value is close enough)

Feature Engineering

Based on the dataset, we decided to create 4 new features using the data from existing features

  1. potential_region: contains the information of whether an employee was placed in potential region to be promoted or not
  2. performance_level: contains the information that show the performance level of an employee. This feature comes from combination of previous_year_rating, KPIs_met >80%, and awards_won?
  3. High_Avg_Tscore: contains the information of whether the average training score of an employee is high enough for promotion chance based on the previous data
  4. male: contains the information to show an employee gender

potential_region

Contains the information of whether an employee was placed in potential region to be promoted or not</br> 1. based on the probability information on promoted_by_region, there are 8 regions that have a higher potential than other regions, namely with a probability value> 10%</br> 2. The regions are 4, 17, 25, 28, 23, 22, 3, 7 (in the order starting from the one with the highest probability value)</br> 3. if the employee is in the region, it will have a value of '1' otherwise it will be worth '0'

performance_level

This feature comes from combination of previous_year_rating, KPIs_met >80%, and awards_won?

Based on the probability obtained in promoted_by_kpi, previous_year_rating, and awards_won? have been shown that the probability will be higher if an employee met this 3 conditions

  1. An employee who met their KPI > 80 % has higher chance to get promotion than others who didn't met their KPI
  2. An employee who got 5 rating in the previous year has 9% higher chance to get promoted than others
  3. An employee who won any award in the previous year has 36% higher chance to get promotion thanothers

This feature will have performance level as its value, from 1 (low) to 4 (best), with the provisions below:

High_Avg_Tscore

This feature will have information on whether an employee has an average training score of more than or equal to 90 which has a greater chance of getting a promotion

  1. based on the probability information on avg_training_score, the opportunity for employees with a value greater than 90 is 76.83%
  2. If the employee has this value then the value is '1' otherwise it will be worth '0'

male

Contains the information to show an employee gender from gender feature

  1. If the gender value is m, male feature will have value 1, if not the value will be set to 0
  2. This is to reduce the feature encoding results that may be affect the number of feature

Feature encoding

To convert the categorical data into numerical format. The goal is to get the optimum learning process of our model. The technique we choose is one-hot-encoding. The reason is the cardinality of the categorical data is not too large

department

education

recruitment_channel

performance_level

Logaritmik

From this step, we use df_pre as the dataset. At the beginning, this dataset was copied from data_clean. So this df_pre will be used to save the log transformation result, norm/std results

Normalization

Machine Learning Modelling and Evaluation

Split Train & Test

Random Forest

K-Nearest Neighbors

Logistic Regression

Decision Tree

XGBoost

Hyperparameter Tuning

Decision Tree Tuning Hyperparameter

XGBoost Tuning Hyperparameter

Predictions

Predict with xgb_best_model fom XGBoost